Skip to content

Conversation

@AliRana30
Copy link

@AliRana30 AliRana30 commented Jan 30, 2026

Rationale for This Change

The SimpleTable constructor in cpp/src/arrow/table.cc can crash if the first column provided is null. This happens because the constructor attempts to dereference columns[0] to determine the row count before performing any validation.
This change ensures safe initialization when columns are empty or the first column is null.


What Changes Are Included in This PR?

A null-pointer safety check has been added to both SimpleTable constructors to prevent dereferencing a null column during initialization:

if (columns.size() == 0 || !columns[0]) {
  num_rows_ = 0;
} else {
  num_rows_ = columns[0]->length();
}

Are These Changes Tested?

Yes. The change has been verified using the existing unit test suite.

Are There Any User-Facing Changes?

No. This change only improves internal safety and does not affect public APIs or user-visible behavior.

@github-actions
Copy link

⚠️ GitHub issue #49079 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added the awaiting review Awaiting review label Jan 30, 2026
@AliRana30
Copy link
Author

AliRana30 commented Jan 31, 2026

Hi @kou ,can you have a look at this ??

@WillAyd
Copy link
Contributor

WillAyd commented Jan 31, 2026

Can you add a test for this?

@kou
Copy link
Member

kou commented Feb 1, 2026

I think that we must not accept columns[0] == nullptr. Caller must provide all valid chunked arrays or arrays.

If we want to check nullptr, we should do it in Table::Make() something like the following:

diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index 68a8a1951f..2fd51d858e 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -260,12 +260,18 @@ std::vector<std::shared_ptr<Field>> Table::fields() const {
 std::shared_ptr<Table> Table::Make(std::shared_ptr<Schema> schema,
                                    std::vector<std::shared_ptr<ChunkedArray>> columns,
                                    int64_t num_rows) {
+  if (std::any_of(columns.begin(), columns.end(), [](const auto& column) {return bool(column);})) {
+    return Status::Invalid(...);
+  }
   return std::make_shared<SimpleTable>(std::move(schema), std::move(columns), num_rows);
 }
 
 std::shared_ptr<Table> Table::Make(std::shared_ptr<Schema> schema,
                                    const std::vector<std::shared_ptr<Array>>& arrays,
                                    int64_t num_rows) {
+  if (std::any_of(arrays.begin(), arrays.end(), [](const auto& array) {return bool(array);})) {
+    return Status::Invalid(...);
+  }
   return std::make_shared<SimpleTable>(std::move(schema), arrays, num_rows);
 }

@AliRana30
Copy link
Author

Hi @kou I have updated the PR to follow your suggestion. I've moved the validation to Table::Make using DCHECK to ensure that callers provide valid (non-null) arrays. I also reverted the change in the SimpleTable constructor to keep it consistent with the "no-nulls-accepted" policy.

Additionally, I've included a fix for the macOS CI failures caused by the missing pkg-config@0.29.2 formula in Homebrew.

@kou
Copy link
Member

kou commented Feb 1, 2026

Why did you use DCHECK instead of arrow::Status? We can use arrow::Status by changing return type of Table::Make to Result<std::shared_ptr<Table>> from std::shared_ptr<Table>. DCHECK is evaluated only with debug build. Is it intentional?

Additionally, I've included a fix for the macOS CI failures caused by the missing pkg-config@0.29.2 formula in Homebrew.

Could you work on it as a separated task (open a separated issue and PR) instead of mixing this PR? It's not related to this issue.

@AliRana30
Copy link
Author

AliRana30 commented Feb 1, 2026

Hi @kou,

Thank you for the feedback!

Regarding Table::Make and DCHECK:
You are absolutely right. I have refactored Table::Make (and Table::MakeEmpty) to return
Result<std::shared_ptr<Table>> instead of std::shared_ptr<Table>.

I have also replaced the DCHECK statements with explicit Status::Invalid returns to ensure validation happens in both debug and release builds. All call sites across the codebase have been updated to handle the new Result type, primarily using ARROW_ASSIGN_OR_RAISE, and .ValueOrDie() where appropriate in tests.

Regarding the macOS CI fix:
Apologies for mixing these changes in the same PR. I have removed the pkg-config related fix from this PR and will submit it as a separate issue and PR, as requested, to keep this refactor focused.

Please let me know if there are any other areas that need adjustments. Thanks again!

@AliRana30
Copy link
Author

Hi @WillAyd,

I've added the requested tests in cpp/src/arrow/table_test.cc to verify the fix.

Specifically, I added:

  • TestTable.MakeInvalidInputs: Verifies that Table::Make properly returns Status::Invalid (instead of crashing) when passed vectors containing nullptr.
  • Additional assertions in the AddColumn and SetColumn tests to ensure they also reject null inputs with Status::Invalid.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Do you think that this std::shard_ptr<Table> to Result<std::shared_ptr<Table>> of Table::Make() backward incompatible change is acceptable?

}
std::shared_ptr<Schema> key_schema = schema(std::move(key_fields));
std::shared_ptr<Table> key_table = Table::Make(std::move(key_schema), key_columns);
std::shared_ptr<Table> key_table = Table::Make(std::move(key_schema), key_columns).ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ARROW_ASSIGN_OR_RAISE()?

Suggested change
std::shared_ptr<Table> key_table = Table::Make(std::move(key_schema), key_columns).ValueOrDie();
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Table> key_table, Table::Make(std::move(key_schema), key_columns));

@AliRana30
Copy link
Author

AliRana30 commented Feb 2, 2026

Thank you for the feedback! I see there is some uncertainty around whether changing to Result<std::shared_ptr<Table>> is an acceptable backward-incompatible change.

Clarifying the expected approach:

  • Should Table::Make() continue to return Result<std::shared_ptr<Table>> (accepting the backward-incompatible
    change) with updates across all language bindings?
  • Or should it continue returning std::shared_ptr<Table> and return nullptr on error with appropriate logging?

I noticed the suggested approach uses Status::Invalid, which implies returning Result<>. Before proceeding, I wanted to confirm the preferred direction since this would affect Python, R, Ruby, and MATLAB bindings.

Minor note:
I believe there is a small typo in the suggested lambda. It should be: [](const auto& column) { return !column; } to correctly check for null values.

What should I go for ):

@kou
Copy link
Member

kou commented Feb 2, 2026

Let's wait for opinions from some C++ maintainers.

@wgtmac
Copy link
Member

wgtmac commented Feb 3, 2026

IMHO, these two Table::Make functions look more like internal functions to return a table instance from good schema and arrays. Should we add comments about its intended behavior instead of introducing breaking changes? We can go through a deprecation process like this:

  1. In the table.h, define two internal functions like below and move current implementations to them to expect good schema and arrays as input.
namespace arrow::internal {

static std::shared_ptr<Table> MakeTable(
    std::shared_ptr<Schema> schema, std::vector<std::shared_ptr<ChunkedArray>> columns,
    int64_t num_rows = -1);

static std::shared_ptr<Table> MakeTable(std::shared_ptr<Schema> schema,
                                        const std::vector<std::shared_ptr<Array>>& arrays,
                                        int64_t num_rows = -1);

}  // namespace arrow::internal
  1. Adapter current Table::Make functions to call the above and migrate all other references of Table::Make to call arrow::internal::MakeTable.

  2. Mark current Table::Make as deprecated. Add new Table::MakeTable to return Result<std::unique_ptr<Table>>.

return Status::Invalid("Schema and columns have different number of fields: ",
schema->num_fields(), " vs ", columns.size());
}
for (const auto& column : columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use std::ranges::any_of(I suggest to use it but I didn't check whether the version of arrow's C++20 compiler provides this method) or std::any_of.

@pitrou
Copy link
Member

pitrou commented Feb 3, 2026

The SimpleTable constructor in cpp/src/arrow/table.cc can crash if the first column provided is null.

Why is that a problem? You're not supposed to pass a null pointer her.

@AliRana30
Copy link
Author

AliRana30 commented Feb 3, 2026

Thank you all for the feedback! I see there are different perspectives on the best approach:

@pitrou - You mentioned "You're not supposed to pass a null pointer here." I agree that ideally callers should validate inputs. The issue was raised because there are code paths (like in some tests) where null columns could reach SimpleTable constructor, causing crashes. Should we close this as "won't fix" and instead add documentation that callers must validate?

@wgtmac - Your deprecation path suggestion makes sense to avoid breaking changes. Would this be the preferred approach if we do decide to add validation?

@HuaHuaY - Noted on using std::any_of for cleaner null checking. I can apply that if we proceed with validation.

Question for team: Should I:

1.Close this PR (accept null inputs are caller responsibility)
OR
2.Implement @wgtmac's deprecation path
&
3.Keep current approach with std::any_of improvement

I found it(the null pointer deference one) as an issue but I will do as you all will say ):

@pitrou
Copy link
Member

pitrou commented Feb 3, 2026

The issue was raised because there are code paths (like in some tests) where null columns could reach SimpleTable constructor, causing crashes.

Are these code paths in the Arrow C++ codebase? If so, they should be fixed.

Should we close this as "won't fix" and instead add documentation that callers must validate?

Well, unless the docstring for an API states that a pointer is optional, it should always be non-null. This is a common convention in C and C++ codebases, I don't think it's worth mentioning it in each and every docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants